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Social media platforms enable people exchange their thoughts, reactions, 
emotions regarding all aspects of their lives. Therefore, sentiment analysis 
using textual data is widely practiced field. Due to large textual content 
available on social media, sentiment analysis is usually considered a text 
classification task. The high feature dimension is an important issue that 


needs to be resolved by examining text meaningfully. The proposed study 
considers a case study of coronavirus (COVID) vaccination to conclude 
public opinions about prospects for vaccination. Text corpus of tweets is 
collected, published between December 12, 2020, and July 13, 2021 is 
considered. The proposed model is developed considering phase-by-phase 
i data analysis process, followed by an assessment of important information 
Natural language processing about the collected tweets on coronavirus disease (COVID-19) vaccine using 
Sentiment analysis two sentiment analyzer methods and probabilistic models for validation and 
Tweeter knowledge analysis. The result indicated that public sentiment is more 
Vaccination positive than negative. The study also presented statistics of trends in 
vaccination progress in the top countries from early 2021 to July 2021. The 
scope of study is enormous regarding sentiment analysis based on keyword 
and document modeling. The proposed work offers an effective mechanism 
for a decision-making system to understand public opinion and accordingly 
assists policymakers in health measures and vaccination campaigns. 
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1. INTRODUCTION 

Sentiment analysis is used for many purposes, such as determining the mood of social media users 
about a topic, their views on social events, and market price equilibrium [1], [2]. On the other hand, Twitter 
is widely preferred as a data source in sentiment analysis studies because it is a popular social network and is 
convenient for collecting data in different languages and content [3]. The coronavirus disease (COVID-19) is 
one of the trending topics on social media platforms, particularly on Twitter, since December 2019 and has it 
kept on to date. In recent months, the vaccine was introduced against coronavirus, and people have different 
opinions regarding vaccination. Still, most people are hesitating and making their personnel choice regarding 
vaccination. Many studies were conducted in recent years to understand possible aspects of the people’ s 
sentiments on the vaccination. However, considering sentiment analysis as a text classification problem and 
Twitter messages composed of short text, the dataset becomes sparse [4], [5]. This poses a significant 


Journal homepage: http://ijece.iaescore.com 


Int J Elec & Comp Eng ISSN: 2088-8708 O 4055 


problem in terms of timing and performance, especially on large-sized data. For this reason, phase-by-phase 
text representation techniques are used in the proposed work to solve this performance problem arising from 
high feature space. 

The proposed study presents a novel framework for sentiment analysis towards the COVID-19 
vaccines based on the tweets. However, various research works on the similar context of sentiment analysis 
based on social media data are presented in the recent literature [6], [7]. In a similar direction, the authors in 
[8] collected the tweets over 3 weeks from the European continent to study the increasing impact of the 
coronavirus. A recent study presented in [9] focused on the temporal assessment of researches on coronavirus 
using different datasets with various machine learning (ML) and natural language processes (NLP). In [10], 
the researchers considered Twitter data for analyzing the trend of wearing a mask. The study has 
demonstrated that people are more serious about wearing masks on March 17, July 27, 2020. The researchers 
in [11] carried opinion mining using TextBlob sentiment analyzer on online knowledge delivery based on 
web scraping articles collected from blogging websites during the pandemic condition. The outcome 
demonstrated that most of the articles belong to positive sentiments than the news articles. The study on 
COVID-19 vaccination in Philippines is conducted in [12], where tweets in English and Filipino language 
with 81.77% accuracy. In [13], Chaudhri et al. have performed sentiment analysis to investigate whether the 
people accept the COVID-19 vaccine. The outcome of this study exhibited appositive attitude towards 
receiving the vaccine slots. However, this study has used very few samples of tweets and did not show the 
data collected for the analysis. A survey work of [14] mentioned effectiveness of sentiment analysis for 
different context needs large text corpus with the suitable techniques. Few research studies consider large text 
data for topic modeling to analyze various aspects of the COVID-19. Bai et al. [15] introduced a topic 
development study on COVID-19 news from Canada. Similarly, a topic modeling is carried to examine news 
media at an early stage of the coronavirus in China [16]. 

The research studies towards using the joint approach of topic modeling and sentiment analysis to 
investigate the impact of the COVID-19. Chandrasekaran et al. [17] and Xue et al. [18] presented study on 
topics and sentiments of user conversation on Twitter regarding COVID-19 pandemic. The study has adopted 
linear discriminant analysis and valence aware dictionary and sentiment reasoner (VADER) to perform 
sentiment classification. In the same way, Xue et al. [18] used latent dirichlet allocation (LDA) technique 
over tweets to understand to analyze public sentiment and mental status during COVID-19 pandemic. 
Recently, much deep learning is adopted for NLP tasks. One such approach is given by [19] where Yang et 
al. have developed a learning model the exhibits increase positive opinion over different timescale. In this 
study, a million multilingual tweets and manually annotated based on different fine-grained emotions. In [20] 
an analysis is made on the issue of fear and panic condition of people were considered based on tweet 
conversation. The authors have conducted comparative analysis which shows naive Bayes is superior to 
logistic regression in the sentiment classification. In [21] presented a detailed design of tweet dataset, 
representing temporal and spatial dimensions for understanding the crisis due to pandemic. The authors have 
introduced network clusters with the identity of people from different regions. Chakraborty et al. in [22] 
applied Gaussian fuzzy classifier on social media data to assess the sentiments of the people during initial 
stage of COVID. The authors have mentioned how popularity negatively impacts the accuracy of the 
sentiment analysis [23]. An approach of the bidirectional encoder representations from transformers (BERT) 
model is adopted in [24], [25] to conduct opinion mining considering two different datasets, one dataset 
subjected to Indian tweets and the other based on the tweets collected from the overall world. However, this 
study has not shown any comparative analysis of their model. 


2. THE PROBLEM DESCRIPTION 

Based on the literature analysis, it has been identified that there existing significant problems that 
need to be effectively resolved to perform sentiment classification and forecasting of other contexts. Existing 
approaches of knowledge extraction have shown a more extensive scale of dependencies towards using 
supervised learning approaches. The adoption of supervised learning approaches significantly offers 
computational complexity, and it offers accuracy at the cost of the larger size of training data. However, most 
of the existing approaches are applicable for offline analysis using a complex analytical strategy, which is not 
cost-effective. Data transformation is one of the primary steps that contribute to data accuracy in sentiment 
analysis. Existing approaches did not emphasize any transformation process and did not demonstrate how 
they have collected the dataset and what steps they have taken to analyze that effectiveness. In this work, all 
significant problems were considered, and effective modeling is carried for sentiment analysis. The proposed 
study presented a significant contribution to address the following challenges particularly: i) collecting and 
effectively analyzing collects a large amount of text corpus for the sentiment analysis, ii) an exploratory 
analysis of the dataset to understand the nature of data, and iii) sentiment analysis for vaccination progress in 
a different color in the fixed timeline for topmost countries. Hence, the adaptiveness of the proposed model. 
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The next subsection presents the proposed solution to address above mentioned challenges by introducing an 
effective and computational-efficient sentiment analysis model. 


2.1. The proposed solution 

The proposed study aims to enhance opinion mining using a supervised learning-based probabilistic 
model to understand better the public attitudes and emotions towards the viewpoint of the COVID-19 
vaccination when it became widely available to most countries. This kind of analysis can be very operative 
for gaining insight into the status of the vaccination campaign and whether the people are aware of the 
situation. Another significant scope of the proposed study is that it can provide a very effective decision- 
making support system concerning health and life security. Therefore, the proposed work introduces a model 
for public opinion analysis based on their text conversation on social media platforms. 

The study collects the data from a Twitter website that users posted to express their sentiments using 
emojis and #Tags. The tweets were collected in the English language then stored in the system database. The 
dataset consists of unstructured and irrelevant information that is not required in this analysis. Hence, 
preprocessing is carried out to remove unwanted data and perform tokenization. The labeling of the dataset is 
carried out based on the polarity computation and subjectivity analysis. Afterward, the dataset is split into 
two subcategories such as training dataset and testing dataset. Significant attributes term frequency-inverse 
document frequency (TF-IDF) is computed to perform sentiments analysis using a probabilistic model. The 
schematic architecture of the proposed framework for opinion mining-based tweet’s analysis is depicted in 
Figure 1. 
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Figure 1. Illustration of the schematic architecture of the proposed system for public opinion mining 


Sentiment analysis plays an important role to derives insights that help many sectors to grow and 
improve, like businesses, markets, healthcare, decision, and planning. Since the proposed study has 
considered a case study of COVID-19 vaccination, the sentiment analysis is usually observed as a text 
classification problem. High dimensional feature space is a significant issue that needs to be addressed for 
effective results. Therefore, the study presents an effective sentiment analysis system that does not suffer 
from features extraction problems and provides a compelling analysis of people’s conversations to 
understand better their sentiments, mental ability, the trend of a topic, and needs in the context of the 
COVID-19 situation. The significant contribution of the proposed research word is discussed: i) the study 
extracted the large corpus of tweet texts on COVID-19 vaccine-related keywords and terms used globally; 
ii) the study conducted phase-wise data modeling for data preparation for the sentiment evaluation; iii) the 
study implemented two sentiment analysis models to explore their effectiveness regarding public emotion 
and sentiment on COVID-19 vaccine; iv) further, the supervised probabilistic approach of the classifier is 
implemented to validate the effectiveness of the sentiment analyzer; v) the study provides sentiment analysis 
in positive, negative, and neutral opinions about the COVID-19 vaccination. Also, time-series statistics of the 
vaccination process around the world are shown in the proposed work; and vi) the proposed study offers to 
provide meaningful and cross-cultural information on the trend of COVID-19 vaccination and public 
concern. 

The remaining part of this paper is described: section 2 discussed a proposed method adopted in 
system designing and developing. This section highlights data collection process, dataset modelling and 
sentiment analysis. Section 3 discusses the probabilistic model adopted in the study for knowledge analysis 
and validation. Section 4 presents result analysis for the proposed system and finally section 5 concludes 
overall contribution of the proposed work. 
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3. PROPOSED METHOD 

The proposed study describes the dataset collection procedure and dataset preprocessing operations. 
Also, this section presents learning models and implementation strategies adopted for the classification of 
sentiment analysis COVID-19 vaccination by discussing a mechanism for feature generation. The 
implementation method further discusses on word embedding for analyzing the text. 


3.1. Data collection 

The contextual information for data gathering is acquired from the Ritchie et al. [26], which 
provides important information such as which countries are using what vaccine and vaccination progress. 
The acquired information acts as a basis for collecting data subjected to the recent tweets. A Twitter account 
is created and linked with Twitter API using python library Tweepy to collect tweets regarding COVID-19 
vaccination. The Tweepy library takes a parameter and provides tweets related to public opinion in return. 
This parameter includes usage of frequently used relevant terms (COVID-19, coronavirus, first dose, second 
dose, 1 dose, 2 doses) and a keyword concerning peoples thought on the vaccine (such as safe, harmful, 
health after vaccination) is considered in the search process with the help of Twitter API filter. The search 
process with Twitter API filter also considers different vaccines used in the entire world, such as BioNTech, 
Sinovac, AstraZeneca, Sinopharm, Moderna, Covaxin, and Sputnik. The tweets data are collected in the only 
English language published between December 12, 2020, to July 13, 2021. The retrieved tweets with 
different keywords and terms regarding public opinion on the COVID-19 vaccine were merged in the 
existing file and stored in the local database under various fields mentioned in Table 1 In total, 
130,036 tweets and 113,743 hashtags were generated by the tweeps. The computing procedure for data 
collection using Twitter API is discussed in the algorithm 1: 


Algorithm 1: Twitter Data Gathering 

Input: Tweeps or user Profile (P), Tweepy API ( Tap), Keyword (K) for Tweets (T) 
Output: Tweet data fields (DF) 

Start: 

Foreach Tweets from K list do 

Auth = tweepy. OAthHandler(Pyey, Psecret) 

Auth. set_access_token(access,ey, ACC€SSsecret) 

Tap: = f1(Auth, with_on_rate_limit = True) 

Username € P 

Init max _T E length of tweets 

Foreach Tweets from 

DF = f2(Tapy.user_timeline, id = username). items(max_T) do 
Kiist = [List of Tweeter User Attributes] 
Store_to_local_databasedo //using panda library 

Tweets = pd. df(Kjjst, colums = [DF]) 

End 


The above-mentioned pseudo construct demonstrates a computing step for Tweeter data collection 
using Twitter API. ‘Typ,’ linked with user Twitter account profile (P) uses two explicit functions f1 and f2. 
The function 'f1' refers to the Tweepy API, and the 'f2’ refers to the function of python library for 
constructing collected data into a structured format. In Table 1, description of data collected or gathered is 
highlighted, including i) data fields (DF) consists of different attributes related to tweeps (T) such that 
DF € {T,, T T; + Ty2}, ii) counts (C) of samples of DF such that C(DF) € Z + {1,N}, iii) data type (DT) 
such that DT e {Int, float, String, Boolean}, and iv) description D € {DF}. The next sub-section discusses the 
process involved in the preprocessing of the acquired dataset. 


Table 1. Data description 


Data Fields Counts Data Type Description 
Id 130036 float Id of the Tweeps (User) 
name 130036 String Name of Tweeps 
Geography 96603 String Location of Tweeps 
Friends 130036 Int Contains list of Tweep’s friends 
Followers 130036 Int Contains list of Tweep’s followers 
Verified 130036 bool Shows authenticity of Twitter account (True/false) 
Text 130036 String Tweets (Public opinion) 
hashtags 113743 String Presence of hashtags (#) 
date 130036 String Date of Tweets 
UE_source 139630 String Device information (Android, Apple, andWindows) 
retweets 130036 Int Shows number of re-posting of tweets 
is_retweet 130036 bool Shows presence of re-posting of tweets 
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3.2. Preprocessing 

The proposed study has conducted an exploratory analysis to understand the preprocessing operation 
requirement over the collected Twitter dataset. Based on the exploratory analysis, it has been observed that 
the dataset is associated with irrelevant characteristics, punctuation, stop words, data fields, repetitive text, 
and many more, which are not significant in the sentiment analysis. Therefore, preprocessing operation is 
required to discard and clean such irrelevant data from the text field of the dataset. The entire preprocessing 
operation is carried out in the following manner discussed: 


3.2.1. Omission of irrelevant data 

In this step of preprocessing, removal of URLs, hashtags (#), special characters, quotes, empty 
spaces, repeating words punctuation are carried out. Here along with lowercasing, emoticons are converted to 
meaningful sentences. Duplications are also considered to make data more effective for preprocessing. 


3.2.2. Tokenization 

The tweet’s text is split into smaller units (tokens). To extract this tweets tweepy is considered, a 
simple approach to utilize python library. Through this relevant tweets are extracted in a straight forward 
method. This process is carried out to provide a generalization ability and understanding to the model by 
interpreting the sequence of the text into smaller units. 


3.2.3. Omission of stopwords 

The stop words generally consist of less information in the text, such as ‘and’, ‘but’, ‘the’, ‘like’, 
and many more. To make extraction of text effectively, stop words which adds ambugity to be removed. 
These words do not describe the meaning of contents; they can be ignored without sacrificing the contextual 
information of the Tweets text. 


3.2.4. Normalization 

In this process, the Tweets text is normalized to its base form using the stemming and lemmatization 
process. The stemming process reduces the text arguments to their stem. For example, ‘tradition’ and 
traditional have the same stem, ‘tradit’. In lemmatization, text words are converted to their base form 
according to the part of speech. For example, the word ‘changing’ gets converted to base form ‘change’. The 
computing procedure for Twitter dataset preprocessing is discussed in the algorithm 2: 


Algorithm 2: Twitter Data Preprocessing 


Input: DF 
Output : Preprocessed Tweet_texts 
Start: 


Init DF,?[] 
Load > DF['Geography’,’ date’, ‘text’] 
DF['date’]>convert to DateTime 
DF['Text’]> Drop duplicate 
def function: Text_preprocessing(text) 
Text_lower = re. findall(text’,’(.||[a— z]||[A—Z]) )) do 
Text = text.lower() 
Text = text. replace argument(Text,’@\w + ','#', RT[\s]+, ‘https?://\\S +’) 
return = Preprocessed Tweet_texts 
Foreach Tweet_text from DF do 
DF, = Text_preprocessing(DF. text) 
DF, = tokenizer(DF, ) 
DF, = stem(DF,) 
End 


The procedure mention in algorithm 2 describes the computing steps for performing the 
preprocessing operation over the Twitter dataset. In this process, initially, an empty vector is initialized as 
DF, (preprocessed tweet data fields) that stores preprocessed data for further sentiment analysis. The data 
field ‘date’ is transformed to the standard representation that shows both date and time of tweet texts. 
Another data field, ‘geography’, is considered to analyze missing fields in the collected tweet data field (DF). 
Further, significant preprocessing is carried out on the tweet texts using a user-defined function 
Text_preprocessing( ). This function takes the ‘text’ field of DF and applies to find, remove, and replace 
operation for the specific category of tweets contents. Table 2 presents a summarization of preprocessing 
actions taken for the text arguments (i.e., tweets contents). The proposed study also presented the word 
clouds in Figure 2 to visualize the frequency of the words after being preprocessed. The text cloud illustrates 
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specific words in the tweet texts regarding the COVID-19 vaccine, such as a COVID-19 vaccine, slots, dose, 
d1, d2, Covaxin, and Sputnik. The bigger and bold word appears means such words are used very frequently 
and it more significant. After preprocessing operation, 129,210 tweets text remained in the that are used for 
further analysis. However, the tweet dataset collected in the proposed work does not contain any labels or 
classes specific to the tweets, particularly concerning people’s opinions, and attitudes. The next sub-section 
discusses the process of tweets text annotation and labeling process for public opinion mining. 


Table 2. Summarization of removed and preprocessed tweets contents 


Text Arguments Preprocessed Action 
Special characters @V(!,?,.,”) Removed 
ULR’s Replace with ‘URL’ 
#Word Removed 
Emoticons Replaced with significant meaning 
Stopword Removed 
Upper case [A-Z] Converted to lower case[a-z] 
Original Text Explain to me again why we need a vaccine @BorisJohnson 
Tokenized explain, to, me, again, why, we, need, a, vaccine, BorisJohnson 
No_stopwords explain, need, vaccine, borisjohnson, 
Stemmed explain, need, vaccine, borisjohnson, 
Lemmatized explain, need, vaccine, borisjohnson, 


Figure 2. Frequency distribution of Tweet text using text cloud 


3.3. Annotation 

In this phase of implementation, the annotation process is carried out based on an unsupervised 
method of opinion mining using two known python libraries taken from Loria [27] and VADER [24] on 129, 
210 tweets text. Both VADER and TextBlob come with essential features of NLP, where VADER performs 
better for words like slang, emojis, conjunctions, punctuations and TextBlob works better with the formal 
words or texts. Both these libraries are considered for the same purpose of computing metrics such as 
subjectivity and polarity. Subjectivity analysis refers to a view on user emotion or opinion. On the other 
hand, polarity refers to factual information that determines whether tweets expresses the tweets’ positive, 
negative, or neutral sentiments. The computation of subjectivity metric regarding each tweet text is based on 
its contextual meaning, i.e., whether it indicates objective or subjective sense determined within the range of 
[0, 1], where values near to 0 means objective and value near to 1 mean subjective sense. The computation of 
polarity metric is carried within the range of [-1, 1], where polarity value of tweet texts closer to -1 indicates 
negative attitude, equal to 0 indicates neutral attitude, and greater closer to 1 indicates positive public 
attitudes or Tweep sentiments. The next process in this implementation is to perform the labeling process 
next to the sentiments field in the dataset. The labeling process is carried based on the semi-manual process 
using python code where we first copy the sentiment filed to label filed and replaced each sentiment with the 
corresponding label. Table 3 shows a sample representation of the updated dataset with the determined 
subjectivity and polarity score, and corresponding sentiments are further merged to the existing preprocessed 
tweets data field (DF). The computing procedure for annotation and labeling is discussed in the algorithm 3. 
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Table 3. A sample representation of tweets dataset with sentiment score and label 


Geography Date Text Subjectivity Polarity Sentiments Label 
Bengaluru, India 2021-07-17 05:00:00 45 rural bengaluru covid 0.400000 0.20 Positive 1 
vaccine availability 
Singapore 2021-10-07 01:22:00 done job Vaccination done 0.000000 0.00000 Neutral 0 
Vancouver, Canada 2020-12-12 20:23:00 facts immutable 0.550000 -0.05000 Negative 2 


Algorithm 3: Data Annotation for Opinion mining 
Input: text 
Output : sentiments, Labels 
Start: 

Foreach Tweet_Text from DF do 
get_sub > f3(DF[’text’]) 

update_DF > DF. append get_sub do 
get_pol > f4(DF['text’]) 
update_DF > DF. append [get_pol]) 
def function: sentiment_analysis (pol_score) 
check pol_score > 0 do 
sentiment = positive 
else check pol_score == 0) do 
sentiment = neutral 
Otherwise sentiment = negative) 
Foreach tweet_text from DF do 
sentiment = sentiment_analysis(DF. Text[pol_score]) 

update_DF > DF. append [sentiment]do 
Labelling > [positive: 1, negative: 2, neutral: 0] 
End 


The above-mention algorithm performs data annotation and labeling process based on the sentiments 
determined from subjectivity (sub) and polarity score (pol_score) using an unsupervised method by using two 
sentiment analysis functions f3 and f4. The function f3 is meant for subjectivity score and f4 for polarity score 
using Textblob. However, a similar computing procedure is also applicable for VADER. The scatter plot 
representing frequency distribution of subjectivity and polarity score of tweets text is given in Figure 3. The 
analysis obtained from Figure 3 provides a clear view of sentiments distribution using TextBlob, where it can 
be seen that dense tweet texts are aligned towards polarity score 0, which clearly indicates more people 
having a neutral opinion. Particularly, 52.15% of total tweets are subjected to neutral opinion, 11.73% of 
total tweets fall under the category of negative opinion, and 36.12% of total tweets are subjected to positive 
opinion. The statistics of sentiments count obtained from both methods is shown in Figure 4. 
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Figure 3. Scatter plot for subjectivity vs a polarity 
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Figure 4. Statistics of sentiments regarding tweets texts 


The statistics obtained from Figure 4 show the sentiments obtained from Textblob and VADER 
unsupervised methods. Based on the analysis, it has been observed that TextBlob has shown more positive 
and neutral sentiments compared to VEDAR. TextBlob 46,665 tweet texts reflect positive sentiments, 
15,162 tweet texts indicate negative sentiment, and 67,383 tweet texts reflect the neutral attitude of people. In 
the case of VEDAR 21,689 tweets represents negative sentiment, 64,609 and 42,912 tweet text indicates 
neutral and positive sentiment. Both techniques have achieved similar performance in terms of positive and 
neutral sentiments. However, performance much varies for negative sentiment. The study also shows 
statistics demonstrating the current trend of the vaccination among different countries given in Figure 5. 
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Figure 5. Statistics of vaccination progress 


3.4. Validation with supervised learning mechanism using probabilistic model and SVM 
Performing analysis of knowledge extracted plays the major role. Along with extraction opinions are 
also to be considerd. In order to get knowledge analysis and more insight on the performance of both opinion 
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mining methods, i.e., TextBlob and VADER, this section presents an implementation of the supervised 
learning mechanism of naive bayes and support vector machine (SVM) classifier. 


3.4.1. Naive Bayes 
The function of naive Bayes is concerned with an approach of the probabilistic model, i.e., 
numerically expressed in (1): 


P(N|M)P(M) 


P(MIN) = POS 


(1) 


where, P(M) denotes the prior probability being true, P(N) refers to the probability of the data, P(M|N) is the 
probability of presumption M for the available data N, and P(N|M) denotes the probability of N given that 
presumption M is true. The proposed study explores public sentiments using text-based data. Therefore, for a 
collection of tweet text documents such that T € {T1,T2,T3--- TN} with a set of preassigned sentiment class 
L € {11,12, and 13}, the task is to select a classification function f that produces the correct sentiment for each 
input document, such thatf (T;) = L. In this regard, it becomes vital to compute the probability of all possible 
values of M and the predicted labels with thoroughgoing probability, which can be numerically 
expressed in (2): 


M = argmaxyP(M) J], P(N; |M) (2) 


In order to train the classifier, the dataset is split within the ratio of 70% training and 30% testing set. 
Further, the study performs TI-IDF (terms frequency-inverse document frequency) vectorization to represent 
tweet text into word vector as a suitable input for the naive Bayes classifier. The terms frequency can be 
computed using the following numerical (3): 


TF(T, W) = ~ (3) 


where, T denotes the count of the word that appears, D represents the number of words in the text document. 
The term frequency (TF) deliberates all words equally significant, and inverse document frequency (IDF) only 
considers the unique terms, numerically can be represented in (4): 


IDF(T) = ~ (4) 


The expression mentioned in (4) measures the significance of W, and N is the appearance of a 
number of words. However, W can be zero, in order to avoid division by zero, the following numerical 
expressions 5 and 6: 


N 
IDF(T) = log (=) (5) 
In (5) 1 is appended to the frequency of word in text document (DF). The final version of expression can be 
numerically expressed: 


TF —IDF(T,W) =TF(T,W) x IDF(T) (6) 


3.4.2. Support vector machine (SVM) 

A SVM is most popular supervised learning mechanism based on the vector principle used to address 
classification and regression problem. The advantage of SVM is that it does not prone to overfitting problem 
unlike another machine learning classifier. Therefore, the proposed study implements SVM classifier in order 
to perform comparative analysis with the naive Bayes technique. Since, the proposed system concern with 
multiple sentiment polarity regarding public opinion or attitude towards COVID vaccination. Therefore, the 
study considers implementation of multi-class SVM as shown in Figure 6. 

In Figure 6 demonstration of multiclass SVM is given, where input tweets text such that 
T € {T1,T2,T3 --- TN} is mapped to the output class such as positive (Pos), neutral (neu) and negative (Neg) 
using linear function such that f(x) = (w.o(x)) + b, where w and b are the weight and bias respectively 
and a(x) denotes mapping function of SVM kernel (K). In the implemented multiclass SVM classifier, the 
proposed study uses two SVM kernel (K1 and K2) connected in series, where each provides its output with 


Int J Elec & Comp Eng, Vol. 12, No. 4, August 2022: 4054-4066 


Int J Elec & Comp Eng ISSN: 2088-8708 O 4063 


single polarity class based on the utmost approximated values. However, usage of multiple SVM kernel may 
pose computational overhead. In this regard, an approach of transition matrix which acts as an auxiliary 
mechanism to each SVM kernel towards processing text data from the observation state to the decision state. 
Also, feature learning performance in training phase depends on the kernel function used in the SVM 
classifier. Although, there are various kernel function available for the SVM, but based on the empirical 
analysis, radial basis kernel function is considered in the SVM implementation numerically given: 


k(x, y) = exp (- eat) (7) 


where, || x — y||? denotes measure of Euclidean space of length between two data points x and y, and y refers 
to hyperparameters such that y = 207, where ø refers to the variance. The core objective of this kernel 
function is to compute similarity or closeness of two data points towards each other. The next section discusses 
the performance analysis of implemented classifiers for public opinion analysis. 


Figure 6. Illustration of classifier with multiple SVM Kernel 


4. RESULTS ANALYSIS AND DISCUSSION 

The entire modeling and development of the proposed system are carried out on the computing 
environment Anaconda using Python. Naive Bayes and SVM classifier, a supervised learning mechanism, are 
considered for public opinion mining regarding COVID-19 vaccination. Multinomial naive Bayes (MNB) 
classifier is selected as it can handle a large tweet text corpus and is suitable for sentiment prediction based on 
the text data. SVM is selected as it less prone to overfitting problems and suitable for better feature learning in 
high dimensional space. The training of the implemented classifier is carried on the preprocessed dataset, 
where 70% dataset, i.e., 90,447 samples, were selected out of a total of 129,210 samples remained after 
preprocessing operation. The classifier namely MNB is trained and also it is tuned by adjusting its 
hyperparameter i.e., alpha (a) that handles the problem of zero probability and performs smoothing in the 
training process. A grid search strategy is adopted to get optimal a values with cross validation approach to 
assess the model performance. In the case of SVM, the classifier is tuned considering hyperparameters namely 
C equal to 10 (acts as a regularizer to control error), gamma equal to 0.0001 and kernel equal to radial basis 
function (RBF). Since the dataset is preprocessed, it does not associate with any imbalance factor or skewness, 
and it has an equal number of sentiment labels. Therefore, the study considers accuracy as the primary metric 
for the performance assessment. 


4.1. Outcome analysis 

In this section, the performance analysis is carried out for both supervised learning classifiers 
considering their output obtained based on the training dataset prepared from both TextBlob and VADER 
sentiment analyzer. Based on the analysis from Figure 7, it has been found that MNB has scored the highest 
accuracy (91.69%) for predicting sentiments of people regarding the COVID vaccine, whereas SVM has 
achieved a little less accuracy score (91.19%) compared to MNB in case of TextBlob sentiment analyzer. In 
the case of other sentiment analyzers, i.e., VADER, SVM outperforms MNB by achieving a higher accuracy 
score (i.e., 86.63%) than SVM (84.22%). Based on the overall analysis, it seems that TextBlob is better than 
VADER, and SVM also seems to be better than MNB. 
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Figure 7. Sentiment classification accuracy (%) with the probabilistic model naïve bayes 


4.2. Discussion 

To get best results, proper and appropriate training is to be considered. Therefore, the classifiers used 
are needed to be trained. It is to be noted that both classifiers’ models are trained over that dataset prepared 
based on the different sentiment analyzer, i.e., TextBlob and VADER. Therefore, the performance analysis is 
carried out regarding classifiers and carried out for both sentiment analyzers. 


4.2.1. Analysis regarding sentiment analyzer 

Both sentiment analyzers provide a wide range of features in the text classification problem. 
However, both techniques are associated with some advantages and disadvantages. The main advantage of 
using TextBlob is that it can efficiently handle many text data without posing much computational overhead. 
VEDAR sentiment analyzer has a wide range of features concerning sentiment lexicons. It doesn’t have a 
large variety of features, and thus, users have to depend on some other libraries for advanced tasks. However, 
it poses a little higher computational overhead compared TextBlob. The VADER bets suit the text that 
consists of slang, and emojis, whereas TextBlob suits better with plain and formal text representation. The 
TextBlob sentiment analyzer is better comparatively than VADER because our dataset is preprocessed and 
does not consist of emoticons and slang in the text dataset. 


4.2.2. Analysis of supervised classifiers 

Therefore, it is quite difficult to say that which classifier among implemented SVM and MNB is 
better for text classification. However, a closer analysis shows that both classifiers have achieved similar 
performance to each other. However, SVM can be considered the most suitable as it is not susceptible to 
catastrophic failures. It better correlates the similarity in the data from the large text corpus, making its 
generalization and feature representation better and more accurate, leading to better performance in sentiment 
polarity score prediction. However, in the case of the outcome are not found to be consistent enough. Based 
on the overall analysis, the TextBlob and SVM are much better than the VADER and MNB in analyzing 
people's opinions or sentiments regarding the COVID-19 vaccination. 


4.2.3. Findings and scope of the study 

The proposed work's major contribution is that it provides a steppingstone to explore and analyze 
public sentiment based on their conversations about COVID-19 vaccination on social media platforms. The 
study findings suggest that public sentiment is more positive than negative sentiments. However, most 
sentiments are observed to be neutral. It also has been found that peoples are conscious about their health and 
lifestyle after vaccination with positive sentiments. However, the good news is that under any circumstances, 
negative sentiment does not exceed positive sentiments. The study also mentioned the time series statistics of 
the trend of vaccination for topmost countries between January 2021 to July 2021. This clearly shows how 
the peoples from different countries have welcomed the vaccine against novel coronavirus. Apart from this, 
the proposed system also has another scope, like analyzing other topics like wearing a mask, social 
distancing, traveling, popular vaccines, and many more. The proposed study provides meaningful and cross- 
cultural information on the trend of COVID-19 vaccination and public concern. 
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5. CONCLUSION 

This paper has presented a sentiment forecast model for analyzing people's attitudes regarding the 
COVID-19 vaccination. The design and development of the proposed model are carried out considering 
phase-wise data modeling based on the collection of large tweet text corpus, thereafter, assessing sentiments of 
people regarding COVID-19 vaccine using two sentiment lexicon methods and probabilistic model for 
knowledge analysis. This work is focused on facilitating a model that can perform sentiment analysis of public 
opinion mining and provide an effective decision-making process in the context of healthcare and a healthy 
lifestyle. The study validated the adopted sentiment methods with two supervised learning approaches, such as 
MNB and SVM. Also, it showed a comparison based on the obtained results so that a suitable mechanism can 
be adopted for sentiment analysis. Based on the comparative analysis, SVM is superior to the MNB, and 
TextBlob is better than the VADER sentiment analyzer. It also serves as a guideline to help public health and 
policymakers to provide the public with necessary services and resources. It also provides effective 
information to government or health officials to better understand vaccination activities. In future work, the 
proposed work may be extended to propose more robust designs that support deep learning mechanisms or 
artificial intelligence and focus on the stability and security aspects to be suitable for real-time deployment 
scenarios. 
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